In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import json
import requests
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from nltk.tokenize import sent_tokenize, wordpunct_tokenize
We will be using VADER (Valence Aware Dictionary and sEntiment Reasoner) for measuring sentiment, specifically in social media. This project has been integrated into the NLTK library, which we already have installed!
The methodology the authors used:
Sentiment ratings from 10 independent human raters (all pre-screened, trained, and quality checked for optimal inter-rater reliability). Over 9,000 token features were rated on a scale from "[–4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)". We kept every lexical feature that had a non-zero mean rating, and whose standard deviation was less than 2.5 as determined by the aggregate of those ten independent raters. This left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word "okay" has a positive valence of 0.9, "good" is 1.9, and "great" is 3.1, whereas "horrible" is –2.5, the frowning emoticon :( is –2.2, and "sucks" and it's slang derivative "sux" are both –1.5.
We can access the lexicon that maps tokens to their average sentiment.
In [ ]:
lexicon_df = pd.read_csv('https://github.com/cjhutto/vaderSentiment/raw/master/vaderSentiment/vader_lexicon.txt',sep='\t',header=None)
lexicon_df.columns = ['token','mean sentiment','std. dev.','human ratings']
lexicon_dict = lexicon_df.set_index('token')['mean sentiment'].to_dict()
lexicon_df.loc[1010:1020]
Traditional sentiment analysis tools like LIWC use a "bag of words" approach that is indifferent to word order, negations, etc. Let's use the word-level sentiment scores from VADER's lexicon, but use them in a super-naive "bag of words" approach.
In the case of these two sentences, we tokenize the sentences into words, look up the words' sentiment scores in the lexicon_dict
, print out each word's score, and then add up the scores for all the words in sentiment_score_words1
. We see that this super-naive bag of words approach gives both of these sentences the same sentiment score.
In [ ]:
words1 = wordpunct_tokenize("The book was good.")
words2 = wordpunct_tokenize("The book was not good.")
sentiment_score_words1 = 0
for w in words1:
lexicon_score = lexicon_dict.get(w.lower(),0)
print(w,lexicon_score)
sentiment_score_words1 += lexicon_score
print('\n')
sentiment_score_words2 = 0
for w in words2:
lexicon_score = lexicon_dict.get(w.lower(),0)
print(w,lexicon_score)
sentiment_score_words2 += lexicon_score
print('\n')
print("The book was good. ==>",sentiment_score_words1)
print("The book was not good. ==>",sentiment_score_words2)
Here are some sentences that have pretty clear valence for human readers, but computers are likely to struggle with. Extending the approach from above for these setnences, we can see how the super-naive bag of words approach.
Easy phrases like "The book was good" are correctly classified as positive, but others like "The book was not good." is incorrectly classified as positive, "The book was not terrible" is incorrectly classified as negative, "The book was kind of good" is incorrectly given a higher sentiment score than "The book was good".
In [ ]:
sentences = ["The book was good.",
"The book was not good.",
"The book was not terrible.",
"The book was kind of good.",
"The book was good!",
"The book was good. :)",
"The book was good LOL"
]
for s in sentences:
words = wordpunct_tokenize(s)
sentiment_score = 0
for word in words:
lexicon_score = lexicon_dict.get(word.lower(),0)
sentiment_score += lexicon_score
print("\"{0}\" ==> {1}".format(s, str(sentiment_score)))
Compared to traditional "bag of words" approaches for sentiment analysis, VADER also preserves word order; includes emoticons (:-)
), slang (meh
), and initialisms (lol
); and punctuation cues.
We initialized the analyzer
model from VADER's SentimentIntensityAnalyzer
class we imported and pass a sentence to the polarity_scores
method. VADER takes care of a lot of the tokenizing, casing, and stemming, so we can give it uncleaned sentences.
The polarity_scores
method returns a dictionary containing the metrics for the proportion of the text that are negative, neutral, and positive. The most important value for our purposes is the "compound" value with is the normalized sum of these valences: this is the sentiment score of a sentence where -1 is extremely negative and 1 is extremely positive.
If we apply this to the two sentences, we can see VADER is much better at identifying the proper sentiment.
In [ ]:
analyzer.polarity_scores("The book was good.")
In [ ]:
analyzer.polarity_scores("The book was not good.")
Let's run VADER on all of the example sentences
and see what their compound sentiment scores. Emoticons and initialisms greatly improve the sentiment! Negations like "not" and "kind of" appropriately reduce the valence.
In [ ]:
for s in sentences:
scores = analyzer.polarity_scores(s)
print("\"{0}\" ==> {1}".format(s, scores['compound']))
In [ ]:
with open('potus_wiki_bios.json','r') as f:
bios = json.load(f)
print("There are {0} biographies of presidents.".format(len(bios)))
We need to tokenize each sentence in the biography. Let's do Grover Cleveland as an example.
In [ ]:
sentences = sent_tokenize(bios['Grover Cleveland'])
sentences
Using VADER's polarity_scores
method on the 76th sentence in his biography, we see a "compound" score of 0.4767. This is a profoundly sad and negative sentiment but VADER still classified this as primarily positive. So sentiment analysis is still very far from perfect in many cases, but per the central limit theorem, aggregating many cases together should return some median tendency.
In [ ]:
print(sentences[75])
analyzer.polarity_scores(sentences[75])
This double for
loop goes through every president's biography, breaks the biography into sentences, and for every one of these sentences, computes a compound sentiment score.
In [ ]:
potus_sentiments = {}
for president,bio in bios.items():
potus_sentiments[president] = []
sentences = sent_tokenize(bio)
for s in sentences:
polarity_scores = analyzer.polarity_scores(s)
compound_polarity = polarity_scores['compound']
potus_sentiments[president].append(compound_polarity)
Once we have a sentiment score for each presidential sentence, we can compute the average across all sentences within a biography, and sort the resulting average presidential sentiment.
In [ ]:
mean_potus_sentiment = pd.Series({potus:np.mean(scores) for potus,scores in potus_sentiments.items()})
mean_potus_sentiment.sort_values(ascending=False)
We can also plot the distribution of sentence sentiments across biographies. Each line is a single president's distribution of sentence sentiments.
In [ ]:
f,ax = plt.subplots(1,1)
ax.axvline(x=0,c='k',lw=1,ls='--')
ax.set_xlabel('Compound sentiment score')
ax.set_title('Distribution of POTUS sentence sentiments')
ax.set_xlim((-1.1,1.1))
for potus,scores in potus_sentiments.items():
_s = pd.Series(scores)
_s.plot.kde(ax=ax,c='blue',alpha=.25)
Alternatively, we can cumulatively add up the sentiments sentence-by-sentence for each presidential biography. In other words, how does the cumulative sentiment of a presidential biography change over the course of an article (childhood to education to early professions to administration to post-POTUS life).
In [ ]:
f,ax = plt.subplots(1,1)
ax.set_xlabel('Sentence index')
ax.set_ylabel('Cumulative compound sentiment')
ax.set_title('POTUS sentiment through article')
ax.axhline(y=0,c='k',lw=1,ls='--')
for potus,scores in potus_sentiments.items():
#_s = pd.Series(np.cumsum(scores)/np.arange(1,len(scores)+1))
_s = pd.Series(np.cumsum(scores))
_s.plot(ax=ax,label=potus,c='blue',alpha=.25)
#f.legend(ncol=4)
In [ ]:
pd.Series({potus:np.sum(scores) for potus,scores in potus_sentiments.items()}).sort_values(ascending=False)
The Trump Twitter Archive maintains an up-to-date archive of all the tweets from @realDonaldTrump. I broke Twitter's TOS and screen-scaped all of this historical tweets (but not retweets) generated by this account as well as for Hillary Clinton, Mitt Romney, and Barack Obama.
In [ ]:
with open('cleaned_tweets_trump.json','r') as f:
trump_tweets = json.load(f)
with open('cleaned_tweets_clinton.json','r') as f:
clinton_tweets = json.load(f)
with open('cleaned_tweets_romney.json','r') as f:
romney_tweets = json.load(f)
with open('cleaned_tweets_obama.json','r') as f:
obama_tweets = json.load(f)
Inspect a single tweet (these aren't in chronological order... yet).
In [ ]:
trump_tweets[0]
Convert this (flat) JSON data to a pandas DataFrame so it's easier to manipulate and visualize. We'll also use this new_tweet_features
function to generate some new features (date, year, hour, weekday, word count) that will be useful.
In [ ]:
def new_tweet_features(df):
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df['weekday'] = df['date'].apply(lambda x:x.weekday())
df['year'] = df['date'].apply(lambda x:x.year)
df['hour'] = df['timestamp'].apply(lambda x:x.hour)
df['wordcount'] = df['text'].apply(lambda x:len(wordpunct_tokenize(x)))
df = df.sort_values('timestamp')
return df
trump_tweets_df = pd.DataFrame(trump_tweets)
trump_tweets_df = new_tweet_features(trump_tweets_df)
trump_tweets_df.head()
A well-known blogpost by David Robinson explored the "source" field in a tweet as a valuable instrument for differentiating between the people communicating through the "realDonaldTrump" account: "Android tweets are angrier and more negative."
President Trump adopted an iPhone in 2017.
We can make a plot by year. We'll facet (using "hue") by the source of the tweet.
In [ ]:
top_sources = ['Twitter for iPhone','Twitter for Android','Twitter Web Client']
sb.catplot(x = 'year',
y = 'sentiment',
hue = 'source',
data = trump_tweets_df,
hue_order = top_sources,
kind = 'point',
aspect = 2,
dodge = .25
)
We can make a plot by hour of the day. We'll facet (using "hue") by the source of the tweet.
In [ ]:
sb.catplot(x = 'hour',
y = 'sentiment',
hue = 'source',
data = trump_tweets_df,
hue_order = top_sources,
kind = 'point',
aspect = 2,
dodge = .25
)
We can make a plot by hour of the day. We'll facet (using "hue") by tweets from different years.
In [ ]:
sb.catplot(x = 'hour',
y = 'sentiment',
hue = 'year',
data = trump_tweets_df,
hue_order = [2015,2016,2017,2018],
kind = 'point',
aspect = 2,
dodge = .25
)
Let's make a plot of sentiment by day of the week (0 = Monday, 6 = Sunday).
In [ ]:
sb.catplot(x = 'weekday',
y = 'sentiment',
hue = 'year',
data = trump_tweets_df,
hue_order = [2015,2016,2017,2018],
kind = 'point',
aspect = 2,
dodge = .25
)
We can perform a crosstab on the data to get a count of tweets by hour of day and day of week since he started his 2016 campaign.
In [ ]:
# Boolean index to only get rows of data after June 16, 2015
pres_trump_tweets = trump_tweets_df[trump_tweets_df['timestamp'] > pd.Timestamp('2015-06-16')]
# Cross-tab by hour and weekday
ct_count = pd.crosstab(index=pres_trump_tweets['hour'],
columns=pres_trump_tweets['weekday'],
)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(7,12))
sb.heatmap(ct_count,ax=ax,cmap='rainbow',square=True)
We could also plot a heatmap of Trump's Twitter sentiment by weekday and hour since he started his campaign.
In [ ]:
# Crosstab by hour and weekday, but values are average sentiment (not counts from before)
ct_sentiment = pd.crosstab(index=pres_trump_tweets['hour'],
columns=pres_trump_tweets['weekday'],
values=pres_trump_tweets['sentiment'],
aggfunc=np.mean
)
# Plot the data
f,ax = plt.subplots(1,1,figsize=(7,12))
sb.heatmap(ct_sentiment,ax=ax,cmap='rainbow_r',square=True)
In [ ]:
fte_approvals = pd.read_csv('https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv',parse_dates=['modeldate'])
fte_voters = fte_approvals[fte_approvals['subgroup'] == 'Voters']
fte_approvals.head()
Plot out the approval rating.
In [ ]:
f,ax = plt.subplots(1,1,figsize=(12,4))
fte_voters.plot.line(x='modeldate',y='approve_estimate',ax=ax)
Trump, like other Twitter users, can tweet multiple times in a day. We will perform a groupby-aggregation operation to compute the average sentiment of all tweets per day.
I've plotted the 90-day rolling average of sentiment and indicated five major events:
In [ ]:
daily_trump_sentiment = trump_tweets_df.groupby('date').agg({'sentiment':np.mean})['sentiment']
daily_trump_sentiment = daily_trump_sentiment.reindex(pd.date_range(daily_trump_sentiment.index.min(),daily_trump_sentiment.index.max()),fill_value=0)
f,ax = plt.subplots(1,1,figsize=(12,4))
daily_trump_sentiment.rolling(90).mean().plot(ax=ax,lw=3)
ax.set_ylim((-.1,.4))
ax.axhline(y=0,c='k',ls='--')
ax.axvline(x=pd.Timestamp('2011-05-01'),c='k') # WH Correspondent's Dinner
ax.axvline(x=pd.Timestamp('2015-06-16'),c='k') # Campaign launched
ax.axvline(x=pd.Timestamp('2016-11-08'),c='k') # Election
ax.axvline(x=pd.Timestamp('2017-01-20'),c='k') # Inauguration
ax.axvline(x=pd.Timestamp('2017-05-17'),c='k') # Special Investigator appointed
ax.set_ylabel('Mean sentiment')
I've made an approval_sentiment_df
DataFrame by combining FiveThirtyEight and sentiment scores we computed. We can plot the relationship between these values using seaborn's lmplot
with LOWESS and print the correlation.
In [ ]:
fte_approval = fte_voters.set_index('modeldate')['approve_estimate']
trump_sentiment = daily_trump_sentiment.loc[fte_approval.index]
approval_sentiment_df = pd.DataFrame({'Approval':trump_approval,'Sentiment':trump_sentiment})
corr = approval_sentiment_df.corr().loc['Approval','Sentiment']
print('Correlation is: {0:.4f}'.format(corr))
sb.lmplot(x = 'Approval',
y = 'Sentiment',
data = approval_sentiment_df,
lowess = True,
line_kws = {'lw':3,'color':'red'},
aspect = 2
)
In [ ]:
f,ax = plt.subplots(1,1,figsize=(12,4))
approval_sentiment_df.rolling(7).mean().plot(secondary_y='Sentiment',ax=ax)
In [ ]:
Step 2: Use seaborn's catplot
to visualize some features of these candidates' Twitter histories (by year, by day of week, by hour of day, by device, etc.). How do these candidates' activity compare to Trump's Twitter activity?
In [ ]:
Step 3: Plot the moving average of other candidates' sentiment over time using a group-aggregation, reindexing, and rolling average like we did with Trump above. What are some key changepoints?
In [ ]:
Here is some code called GetOldTweets with some complex dependencies that flagrantly violates Twitter's Terms of Service in order to get all of the tweets from a user's timeline (rather than the 3,200 most recent). Twitter will eventually block your IP address if you run this code too much.
I am comfortable violating Twitter's ToS in order to retrieve an archive of presidential candidates' tweets that is not otherwise available.
In [ ]:
from GetOldTweets import got3
from datetime import datetime, timedelta
def tweet_converter(tweet):
"""
Takes a tweet object generated by the GetOldTweets getTweets() function
and converts it to a dictionary
"""
d = {'author_id': tweet.author_id,
'favorites': tweet.favorites,
'formatted_date': tweet.formatted_date,
'location': tweet.geo,
'hashtags': tweet.hashtags,
'id': tweet.id,
'mentions': tweet.mentions,
'permalink': tweet.permalink,
'retweets': tweet.retweets,
'text': tweet.text,
'urls': tweet.urls,
'username': tweet.username}
return d
def tweet_grabber(screen_name,year,write_json=False):
"""
Takes a screen_name as a string and a year and pulls all the tweets and replies from that account in that year
Writes the tweets as two separate JSON files in the format 'tweets_<screenname>_<year>.json'
and 'replies_<screenname>_<year>.json'
Returns two dictionaries, containing the tweets and replies
"""
year = int(year)
# Get the tweets for a single year and convert the tweet objects to a dictionary
tweet_criteria = got3.manager.TweetCriteria().setUsername(screen_name).setSince("{0}-01-01".format(year)).setUntil("{0}-01-01".format(year+1)).setMaxTweets(9999)
tweets = got3.manager.TweetManager.getTweets(tweet_criteria)
converted_tweets = [tweet_converter(t) for t in tweets]
# Get the replies for a single year and convert the tweet objects to a dictionary
replies_criteria = got3.manager.TweetCriteria().setUsername(screen_name).setSince("{0}-01-01".format(year)).setUntil("{0}-01-01".format(year+1)).setQuerySearch('filter:replies').setMaxTweets(9999)
replies = got3.manager.TweetManager.getTweets(replies_criteria)
converted_replies = [tweet_converter(t) for t in replies]
# Only write to disk at this step if write_json is True
if write_json:
# Save to disk
with open('tweets_{0}_{1}.json'.format(screen_name,year),'w') as f:
json.dump(converted_tweets,f)
with open('replies_{0}_{1}.json'.format(screen_name,year),'w') as f:
json.dump(converted_replies,f)
return converted_tweets, converted_replies
def tweet_grabbing_factory(screen_name,min_year=2006,max_year=2018,write_csv=True):
"""
Given a screen name, get all of their tweets and replies between the min_year and max_year
Returns a DataFrame containing all the tweets in the time range
"""
min_year = int(min_year)
max_year = int(max_year)
tweet_df_list = []
for year in range(min_year,max_year+1):
try:
tweets, replies = tweet_grabber(screen_name,year,False)
# Convert to DataFrames
tweets_df = pd.DataFrame(tweets)
replies_df = pd.DataFrame(replies)
combined_df = pd.concat([tweets_df,replies_df])
# Store in the collection for future cleanup
tweet_df_list.append(combined_df)
except KeyboardInterrupt:
raise
except:
print("Error on {0} in {1}".format(screen_name,year))
pass
# Concatenate and cleanup all the DataFrames
all_tweets_df = pd.concat(tweet_df_list)
all_tweets_df = all_tweets_df.drop_duplicates(subset=['id'])
# Convert to real timestamps to sort all tweets by time
all_tweets_df['timestamp'] = all_tweets_df['formatted_date'].apply(lambda g: datetime.strptime(g, '%a %b %d %H:%M:%S +%f %Y'))
all_tweets_df = all_tweets_df.sort_values('timestamp').reset_index(drop=True)
if write_csv:
# Write the tweet history to disk
all_tweets_df.to_csv('all_tweets_{0}.csv'.format(screen_name),index=False)
return all_tweets_df
Get the tweets for each candidate. This code is "commented" out to prevent it from being inadvertently run because it takes on the order of 2 hours to complete.
Using the python-twitter
library and API developer keys, we can "hydrate" the tweets from the official Twitter API to get the full payload of information.
In [ ]:
import twitter
from bs4 import BeautifulSoup
api = twitter.Api(consumer_key = 'get your own',
consumer_secret = 'get your own',
access_token_key = 'get your own',
access_token_secret = 'get your own',
tweet_mode = 'extended',
sleep_on_rate_limit = True
)
def chunks(l, n):
# https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def hydrate_tweets(df,column='id'):
tweets = []
for chunk in chunks(df[column].tolist(),100):
_statuses = api.GetStatuses(chunk)
_statuses = [t.AsDict() for t in _statuses]
tweets += _statuses
return tweets
Hydrate the tweets for each candidate. This code is "commented" out to prevent it from being inadvertently run because it takes on the order of 5 minutes to complete.
Write the hydrated data to disk.
In [ ]:
for filename,hyrdrated_tweets in zip(['raw_tweets_obama.json','raw_tweets_clinton.json','raw_tweets_romney.json','raw_tweets_trump.json'],[obama_hydrated_tweets,clinton_hydrated_tweets,romney_hydrated_tweets,trump_hydrated_tweets]):
with open(filename,'w') as f:
json.dump(hyrdrated_tweets,f)
The clean_tweets
function will simplify the hydrated tweets into simpler data structure.
In [ ]:
def clean_tweets(tweet_list):
cleaned_tweets = []
for i,t in enumerate(tweet_list):
cleaned_t = {}
cleaned_t['timestamp'] = t.get('created_at',np.nan)
cleaned_t['date'] = datetime.strftime(pd.Timestamp(cleaned_t['timestamp']),'%Y-%m-%d')
cleaned_t['favorites'] = t.get('favorite_count',np.nan)
cleaned_t['text'] = t.get('full_text',np.nan)
cleaned_t['hashtags'] = ', '.join([h.get('text',np.nan) for h in t['hashtags']])
cleaned_t['tweet_id'] = t.get('id_str',np.nan)
cleaned_t['retweets'] = t.get('retweet_count',np.nan)
cleaned_t['screen_name'] = t['user'].get('screen_name',np.nan)
cleaned_t['mentions'] = ', '.join([u.get('screen_name',np.nan) for u in t['user_mentions']])
cleaned_t['source'] = BeautifulSoup(t.get('source','')).text
# This is where the sentiment analysis magic happens
cleaned_t['sentiment'] = analyzer.polarity_scores(cleaned_t['text'])['compound']
cleaned_tweets.append(cleaned_t)
return cleaned_tweets
Apply the clean_tweets
functions to the hydrated tweets for each candidate. Write the cleaned data to disk.
In [ ]:
obama_cleaned_tweets = clean_tweets(obama_hydrated_tweets)
clinton_cleaned_tweets = clean_tweets(clinton_hydrated_tweets)
romney_cleaned_tweets = clean_tweets(romney_hydrated_tweets)
trump_cleaned_tweets = clean_tweets(trump_hydrated_tweets)
for filename,cleaned_tweets in zip(['cleaned_tweets_obama.json','cleaned_tweets_clinton.json','cleaned_tweets_romney.json','cleaned_tweets_trump.json'],[obama_cleaned_tweets,clinton_cleaned_tweets,romney_cleaned_tweets,trump_cleaned_tweets]):
with open(filename,'w') as f:
json.dump(cleaned_tweets,f)